SiRen: Leveraging Similar Regions for Efficient & Accurate Variant Calling
نویسندگان
چکیده
Next-generation genomic sequencing costs are rapidly decreasing, having recently reached the $1000per-genome barrier, a likely tipping point for widespread clinical use. However, genomic analysis techniques have failed to keep pace. In particular, the process of variant calling, or inferring a sample genome from the noisy sequencing data, introduces major computational and statistical challenges. In this work, we explore the feasibility of a hybrid approach that addresses these challenges by partitioning the genome into easier and harder regions, deploying efficient algorithms on the easier regions, and relying on more expensive and accurate technologies in the harder regions. We propose that near duplication, or similarity, in the genome is a natural signal for identifying harder regions, and then present a large-scale distributed clustering approach to identify these similar regions. We perform an extensive empirical study illustrating the effectiveness of existing variant calling algorithms on the easier regions and their contrasting struggles on the similar regions. We also confirm that the similar regions are sufficiently disjoint, thus providing the opportunity for sophisticated analysis of these regions in an embarrassingly parallel manner.
منابع مشابه
GW-CALL: Accurate Genome-Wide Variant Caller
The main challenge in reliable variant calling using DNA reads is to extract information from reads mappable to multiple locations on the reference genome. Conventional approaches ignore these reads and rely on reads mappable uniquely to the reference genome. These approaches fail to perform satisfactorily in variant calling within repeat regions which are abundant in many species including hom...
متن کاملA statistical variant calling approach from pedigree information and local haplotyping with phase informative reads
MOTIVATION Variant calling from genome-wide sequencing data is essential for the analysis of disease-causing mutations and elucidation of disease mechanisms. However, variant calling in low coverage regions is difficult due to sequence read errors and mapping errors. Hence, variant calling approaches that are robust to low coverage data are demanded. RESULTS We propose a new variant calling a...
متن کاملVariant Callers for Next-Generation Sequencing Data: A Comparison Study
Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools,...
متن کاملLeveraging Identity-by-Descent for Accurate Genotype Inference in Family Sequencing Data
Sequencing family DNA samples provides an attractive alternative to population based designs to identify rare variants associated with human disease due to the enrichment of causal variants in pedigrees. Previous studies showed that genotype calling accuracy can be improved by modeling family relatedness compared to standard calling algorithms. Current family-based variant calling methods use s...
متن کاملGappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics
Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect t...
متن کامل